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Abstract 

This  reports  provides  an  overview  of  the  findings  and  software  that  have  evolved  from  the  ’’Symbolie  MT  with  Statistieal 
NLP  Components”  projeet  over  the  last  year.  We  present  the  major  goals  that  have  been  aehieved  and  discuss  some  of 
the  open  issues  that  we  intend  to  address  in  the  near  future.  This  report  also  contains  some  details  on  the  usage  of  some 
software  that  has  been  implemented  during  the  project. 
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OTHER  SENIOR  PERSONNEL  (Ph.D.): 
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POSTDOCS: 

Nizar  Habash,  Postdoetoral  Researeher,  University  of  Maryland,  habash@umiacs .  umd .  edu 
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STUDENTS: 

Neeip  Fazil  Ayan,  University  of  Maryland,  nfa@umiacs .  umd.edu 

Nitin  Madnani,  University  of  Maryland,  nmadnani@umiacs .  umd .  edu 

2  COLLABORATIONS  (BROADLY  CONCEIVED) 

1.  Presentations  by  Bonnie  Dorr  to  Georgetown  on  the  use  of  linguistie  information  in  hybrid  statistieal/symbolie 
tasks  (summarization,  maehine  translation,  divergenee  unraveling). 

2.  Collaboration  with  Philip  Resnik  for  the  JHU/ONR  MURI  projeet. 

3  PROJECT  EINDINGS 

1.  Translation  divergenees  are  frequently  oeeurring:  Examination  of  the  Spanish-English  parallel  eorpus  shows  that 
divergenees  oeeur  in  35%  of  the  sentenee  pairs. 

2.  Unraveling  the  divergenees  with  linguistieally  motivated  universal  rules  results  in  improved  word-level  alignments: 
Experiments  for  Spanish-English  alignments  show  statistieally  signifieant  improvements  of  DUSTer  eompared  to 
a  state-of-the-art  statistieal  aligner  (GIZA-i-i-). 

3.  Generation-Heavy  Maehine  Translation  has  a  higher  degree  of  robustness  than  a  statistieal  translation  system 
(IBM-4)  and  seores  higher  when  the  test  set  is  not  from  the  same  genre  the  statistieal  system  was  trained  on. 
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4  OPPORTUNITIES  FOR  TRAINING  AND  DEVELOPMENT  (AT  ALL  GRADE  LEVELS) 


The  improved  word-alignment  eorpora  ean  be  used  for  Maehine  Translation  in  a  number  of  ways: 

1 .  Improved  translation  dietionary  extraetion 

2.  Improved  statistieal  maehine  translation 

In  addition,  the  universal  rules  help  to  identify  sentenees  that  eontain  divergenees,  and  this  information  ean  be  used  to 
exelude  them  from  parallel  texts  that  are  used  to  train  a  statistieal  aligner.  This  results  in  less  eomplex  a  eorpus,  on 
whieh  standard  statistieal  aligners  ean  be  trained.  This  is  eurrently  examined  by  Dr.  Philip  Resnik  for  Chinese-English 
translation  in  the  eontext  of  the  MURI  projeet. 

The  GHMT  system  has  reeently  been  adapted  to  Chinese,  and  at  this  point  we  are  also  adapting  it  to  Arabie.  GHMT  is 
also  used  for  eross-lingual  summarization,  where  summarization  and  translation  are  fully  integrated. 

-  OUTREACH  ACTIVITIES  (DEFINED  TO  BE  OUTSIDE  OUR  PROFESSIONAL  COMMUNITIES) 

5  PUBLICATIONS  AND  PRODUCTS 

5.1  JOURNAL/CONFERENCE  PUBLICATIONS 

Bonnie  Dorr,  Neeip  Fazil  Ayan,  and  Nizar  Habash.  ’’Divergenee  Unraveling  for  Word  Alignment  of  Parallel 
Corpora”.  Submitted  to  Natural  Language  Engineering. 

Bonnie  Dorr,  Neeip  Fazil  Ayan,  Nizar  Habash,  Nitin  Madnani,  and  Rebeeea  Hwa,  ’’Rapid  Porting  of  DUSTer  to 
Hindi”,  ACM  Transactions  on  Asian  Language  Information  Processing  (TALIP),  2:3,  2003. 

Nizar  Habash  and  Bonnie  Dorr,  ’’CatVar:  A  Database  of  Categorial  Variations  for  English”,  in  Proceedings  of  the 
MT  Summit,  New  Orleans,  FA,  pp.  471^74,  2003. 

Nizar  Habash  and  Bonnie  Dorr,  ”A  Categorial  Variation  Database  for  English”,  Proceedings  of  North  American 
Association  for  Computational  Linguistics,  Edmonton,  Canada,  pp.  96-102,  2003. 

Nizar  Habash.  ’’Matador:  A  large  seale  Spanish-English  GHMT  system”.  In  Proceedings  of  the  MT-Summit, 
pages  149  156,  2003. 

Habash,  Nizar,  Bonnie  J.  Dorr,  and  David  Traum,  ’’Hybrid  Natural  Eanguage  Generation  from  Eexieal  Coneeptual 
Struetures”,  Machine  Translation,  18:2,  2003. 
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5.2  ONE-TIME  PUBLICATIONS  (INCLUDES  BOOK  CHAPTERS  AND  DISSERTATIONS) 

Nizar  Habash.  Generation-Heavy  Hybrid  Machine  Translation.  PhD  Thesis,  University  of  Maryland,  2003. 

Necip  Fazil  Ay  an.  Injecting  Linguistic  Information  to  Improve  Word  Alignments  for  Statistical  MT  Systems.  PhD 
Research  Proposal,  University  of  Maryland,  2004. 


6  OTHER  PRODUCTS 

1.  DUSTer:  A  word-aligner  that  combines  statistical  information  and  linguistic  rules  for  divergence  unraveling. 
Download:  http: //clipdemos . umiacs .umd.edu/duster/duster.tar.gz 

Installation: 

gunzip  duster .tar .gz 
tar  -xf  duster. tar 
documentation: 

DUSTer-Package/ docs /README 

2.  CatVar:  A  Categorial  Variation  Database  for  English.  CatVar  is  an  extensive  is  an  extensive  database  of  morpho¬ 
logical  variation  for  English.  CatVar  is  integrated  into  GHMT  in  order  to  increase  the  flexibility  of  the  generation 
of  the  English  translation.  Online  demo:  http :  /  /clipdemos  .umiacs .  umd.edu/ catvar/ 

3.  Parallel  corpora  for  Spanish-English  and  Hindi-English  with  DUSTer  word-level  alignments. 

4.  Generation  Heavy  Machine  Translation  system  (GHMT).  Currently,  GHMT  supports  Spanish  to  English  and  Chi¬ 
nese  to  English  translation.  Download:  http :  /  /clipdemos  .umiacs . umd.edu/ghmt/GHMT-PAK.tar . gz 

Installation: 

gunzip  GHMT-PAK. tar . gz 
tar  -xf  GHMT-PAK. tar 

documentation: 

GHMT:  GHMT-PAK/GHMT/install .  readme 
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7  CONTRIBUTIONS 


7.1  CONTRIBUTIONS  WITHIN  THE  DISCIPLINE 

1 .  An  extensive  database  of  morphologieal  variation  for  English. 

2.  Improved  word  mappings  between  Spanish  and  English  and  Hindi  and  English. 

3.  Robust  Maehine  Translation  system  from  Spanish  to  English  and  Chinese  to  English. 

7.2  CONTRIBUTIONS  TO  OTHER  DISCIPLINES  (THIS  IS  NOT  EXPECTED  EROM  ALL  PROJECTS) 

The  projeet  is  relevant  to  the  augmentation  of  eapabilities  useful  for  intelligenee  analysts,  sueh  as  eross-lingual  summa¬ 
rization  and  data  mining. 

In  addition,  the  eorpora  with  improved  word-level  alignments  ean  be  used  for  general  resouree  projeetion  from  English 
onto  the  foreign  part  of  a  parallel  eorpus. 

-  CONTRIBUTIONS  TO  THE  DEVELOPMENT  OE  HUMAN  RESOURCES  (SPECIEIC  EOCUS  ON  RE¬ 
SEARCH  OPPORTUNITIES,  UNDERREPRESENTED  GROUPS,  EDUCATIONAL  MATERIALS,  AND  MEM¬ 
BERS  OE  THE  PUBLIC) 

7.3  CONTRIBUTIONS  TO  RESOURCES  EOR  RESEARCH 

This  work  provides  an  integral  part  for  many  NLP  applieations  that  require  eross-lingual  proeessing  or  proeessing  in  a 
resouree-poor  foreign  language. 

7.4  CONTRIBUTIONS  BEYOND  SCIENCE  AND  ENGINEERING  (THESE  CAN  BE  SPECULATIVE) 

The  research  carried  out  in  this  project  contributes  to  the  development  of  better-performing  machine  translation  systems. 
The  availability  of  high-performance  MT  has  far-reaching  consequences  for  society  in  general,  as  it  facilitates  laymen 
and  professionals  in  accessing  information  that  is  authored  in  a  language  they  do  not  understand. 


8  PLANS  FOR  THE  NEXT  YEAR,  IE  CHANGED 


Recently,  we  have  designed  a  prototype  that  will  allow  us  to  use  many  different  resources,  such  as  statistical  aligners, 
linguist  rules,  cognate  lists,  and  dictionaries,  and  combine  their  partial  evidence  to  yield  more  accurate  word-level 
alignments. 

Additionally,  we  will  adapt  our  GHMT  implementation  to  Arabic-English  translation. 

Our  funding  for  this  project  ends  in  the  summer  of  2005.  We  will  need  additional  funds  for  the  2  years  after  the  project 
has  expired  to  continue  the  high  level  of  activity  toward  this  effort  that  we  have  contributed  over  the  last  year.  For  this, 
we  have  one  proposal  currently  under  review: 

’’Divergence  Resolution  for  Interlingual  Variation  Encoding  (DRIVE)”,  REFEEX  submission.  Broad  Agency  An¬ 
nouncement  (BAA-04-01 -EH),  May  2004. 

9  SPECIAL  REPORTING  REQUIREMENTS,  IF  ANY 

None. 

10  UNOBLIGATED  EUNDS  (ONLY  IF  OVER  20  % ) 

N/A 

11  SIGNIEICANT  CHANGE  IN  USE  OF  HUMAN  SUBJECTS 

None. 

A  DUSTer 

usage:  end-to-end.pl  < argument- file > 

end-to-end.pl  processes  an  entire  corpus  of  two  languages  and  projects  dependency  trees  and  alignments  from  one  corpus 
to  another,  using  DUSTer.  (See  README.overview  for  more  information  on  what  DUSTer  is)  The  only  argument  t  o 
the  script  is  a  file  specifying  pafhs  for  external  scripfs  and  programs,  and  f  he  argumenfs  for  DUSTer. 
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In  order  to  run  DUSTer  correctly,  please  make  sure  that  you  put  the  following  into  your  shell  files  (into  your  .tcshrc  file 
or  whatever  you  use): 

setenv  DUSTERPATH  <duster-path>  set  path  =  (  $path  $DUSTERPATH/bin  ) 

DUSTERPATH  should  be  set  to  the  directory  that  contains  pos,  rel,  bin,  epgen,  lib,  docs  directories.  Without  these 
statements  and  the  correct  setting  of  DUSTERPATH,  DUSTer  will  not  work!  (After  installing  DUSTer  on  your  system, 
make  sure  that  you  set  DUSTERPATH  correctly) 

EXAMPEE  RUN: 

In  order  to  run  DUSTer  on  a  small  English-Spanish  (or  English-Hindi)  example,  go  to  examples/Spanish  (or  Hindi) 
directory  and  run 


end-to-end.pl  end-to-end-arguments,  sp 
or 

end-to-end.pl  end-to-end-arguments.hin 

If  you  get  comb-aligned  directory  under  DUSTer-Results,  that  means  you  have  run  DUSTer  successfully  on  this  small 
example. 

IMPORTANT  REQUIREMENTS: 

1.  To  run  runGIZA-i-i-.pl,  you  should  define  fhe  correcf  pafh  for  GIZA-i-i-  and  EGYPT  package  in  your  shell  file  (i.e., 
GIZAPATH  and  EGYPTPATH).  Eor  example, 

sefenv  GIZAPATH  /dfs/projecfs/clip-proj/IBM-MT/bin 
sef  pafh  =  (SGIZAPATH  $palh) 

sefenv  EGYPTPATH  /fs/clip-archive/connor/Corpora/nof-really-corpora/cvs-exporf 
s/EGYPT/bin  sef  pafh  =  ($EGYPTPATH  $pafh) 

Make  sure  you  have  fhese  programs  on  your  machines  and  sef  fhese  environmenf  variables  correcfly. 

2.  To  run  Pyfhon  files,  you  should  have  a  correcf  poinfer  fo  Pyfhon  version  2.3,  as  discussed  above. 

ARGUMENTS  TO  END-TO-END.PE 

The  only  argumenf  fo  fhe  scripl  is  a  file  specifying  pafhs  for  exfernal  scripfs  and  programs,  and  fhe  argumenfs  for  DUSTer. 
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Here  is  an  example  argument  file  (which  is  under  examples/Spanish/end-to-end-arguments.sp)  Please,  make  sure  that 
the  argument  file  contains  a  value  for  each  of  these  variables  (You  can  simply  copy  this  file  and  change  the  values  in  the 
second  column  appropriately) 

Program  Paths 

EPEXTRACT  $DUSTERPATH/lib/epextract.py 

COMBAEIGN  $DUSTERPATH/lib/combalign.py 

GIZA2INEERDEPGRAPH  $DUSTERPATH/utils/giza2infer.pl 
RUN.GIZA  $DUSTERPATH/utils/runGIZA++.pl 

Make  sure  you  change  this  to  the  path  for  Python  2.3  or  later  versions  on  your  machines.  The  following  is  just  an 
example  path  and  probably  will  not  be  in  your  system.  PYTHON  /usr/local/stow/Python-2.3.2/bin/python2.3 

Arguments 

DUSTER_CONPIGURATION_PIEE  config.  Spanish 
EEETXANGUAGE  English 

RIGHTXANGUAGE  Spanish 

EEET.CORPUS  dev.eng 

RIGHT.CORPUS  dev.sp 

DEPENDENCY_TREEXIR  dev.ecollins.dep-dir 

AEIGNMENT  XIEES  _DIR  dev.  giza.  alignments-dir 

OUTPUTXIR  DUSTer_Results 

The  first  four  variables  in  the  argument  file  correspond  to  some  external  programs  distributed  with  this  package.  You 
don’t  need  to  change  those  lines  unless  you  move  those  scripts  to  other  directories. 

PYTHON  variable  should  be  set  to  the  path  for  Python  version  2.3  or  later.  Otherwise,  the  python  scripts  will  not  work. 

The  rest  of  the  arguments  depends  on  the  corpus  you  are  running  DUSTer  on.  DUSTERXONEIGURATIONXIEE  is 
the  file  where  DUSTer-specific  paths  and  variables  are  set.  This  must  be  done  for  each  language  pair  once  and  then  it  can 
be  use  on  every  parallel  corpus  on  those  two  languages.  Eor  Hindi  and  Spanish,  example  configuration  files  are  provided 
in  the  example  directory  (config. Spanish  and  config. hindi).  Eor  more  information  about  this  configuration  file,  please  see 
README.auto-DUSTer. 

EEET  .CORPUS  refers  to  the  English  corpus  and  RIGHT  .CORPUS  refers  to  the  foreign  language  corpus.  These  should 
contain  one  sentence  per  line  and  they  should  be  parallel  (i.e.,  the  sentences  on  the  same  line  are  translations  of  each 
other) 
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DEPENDENCY.TREE  JDIR  is  the  directory  that  contains  dependency  trees  for  English  sentences.  Each  file  MUST  be 
named  as  tree.n  where  n  is  the  number  of  the  sentence  in  the  corpus  (tree.l,  tree.2,  etc.).  The  directory  name  should  in¬ 
clude  the  name  of  the  parser.  (The  current  standard  is  j  corpus-name^,  j parser^. dep-dir  (for  example,  devdata.ecollins.dep- 
dir  or  devdata.eminipar.dep-dir).  The  name  of  the  parser  is  important  to  locate  the  necessary  mapping  files.  Eor  fhe 
formatting  of  each  file,  please  see  READMe.aufo-DUSTer  or  example  frees  under  example  direcfory. 

AEIGNMENT_EIEES  JDIR  is  fhe  direcfory  fhaf  confains  fhe  inifial  alignmenf  files.  Each  file  MUST  be  named  a  align.n 
where  n  is  fhe  number  of  fhe  senfences  in  fhe  corpus  (align.  1,  align.2,  efc.)  Each  file  sfarfs  wifh  ’’begin  n”  and  ends  wifh 
’’end”.  In  befween,  each  English  word  is  associafed  wifh  a  bunch  of  EE  words.  Please,  see  READMe.aufo-DUSTer  or 
example  alignmenf  files  under  example  direcfory  for  furfher  defails. 

OUTPUT _DIR  is  fhe  direcfory  where  all  oufpuf  files  will  be  written.  The  senfences  for  which  DUSTer  runs  successfully 
will  be  placed  under  succ  direcfory.  Ofherwise,  fhe  relafed  files  for  fhaf  senfence  will  be  under  err  direcfory.  The  final 
alignmenf  files  (i.e.,  DUSTer  alignmenfs)  will  be  placed  under  succ/comb-aligned  direcfory.  Each  oufpuf  file  in  fhis 
directory  is  named  as  ehmap.n  where  n  is  fhe  senfence  number. 


B  GHMT 

REQUIRED  RESOURCES 

1.  EISP:  Infernafional  Allegro  CE  Enferprise  Edifion  6.0  (Eranz  Inc.) 

2.  Perl:  v5.8.0 

3.  Connexor  parser  (English  and  Spanish)  (from  www.connexor.com)  See  insfrucfions  below  on  hooking  up  fhe 
connexor  clienf  to  fhe  resf  of  fhe  system. 

4.  Nifrogen  Morphology  Supporf 

Nifrogen  is  available  af:  hflp://www.isi.edu/nafural-language/projecls/nilrogen/ 

Specifically,  fhe  morphology  files,  nifro.english.morph.lisp  nifro.morph.8.98.1isp  nilromorph-8-98.1isp  musf  be 
placed  under  $PACKAGE/EXERGE/SOURCE/oxyexerge/ 

5.  Halogen  Eoresf  Ranker 

Halogen  is  available  af:  hflp://www.isi.edu/licensed-sw/halogen/  All  code  from  fhe  foresl  ranker  should  be  in- 
sfalled  under  $PACKAGE/HAEOGEN/EoreslRanker 


Make  sure  the  variables  in  sysVars.eshre  are  added  to  your  .eshre 


The  souree  files  for  the  Exerge  system  are  ineluded  in  this  paekage  in  addition  to  ereated  images  on  Solaris,  to  remake 
these  images,  run  $PACKAGE/ake-Exerge.sh 

See  a  Sample  run  of  Matador  below. 

CONNEXOR  SPECIEIC  INSTRUCTIONS 

1.  Contaet  www.eonnexor.eom  to  obtain  a  lieense  for  English  and  Spanish  parsers. 

2.  Update  the  host/port  in  the  files  fdges-elient.pl  (for  Spanish)  and  fdgen-elient.pl  (for  English).  The  eurrent  values 
should  look  like  this  for  fdges-elient.pl: 

$remote_host=”  eheeseeake.umiaes.umd.edu” 

Sremote  _port=”  11 720” 

and  as  follows  for  fdgen-elient.pl 

Sremote  _host=”  eheeseeake.umiaes.umd.edu” 

Sremote  _port=”  11 72 1” 

SAMPEE  RUN 

>  matador.pl  test  out2  x  params=params .matador . 2 
parameter  params  =  params .matador . 2  ...  loading 

Processing  Batch  #0 
PARSING. . . 

TRANSLATING. . . 

reading  /fs/clip-plus/habash/PACKAGE/TRANSLEX/Spanish-English/span-eng.tralex  ...  done 

translating  torotemp.dep  ... 

done 

CONVERTING, EXPANDING. . . 

/f s/clip-plus/habash/PACKAGE/EXERGE/ cor exerge . sh  torotemp. trans . amr  torotemp . out . amr 
T  NIL  T  T  T  10  10  10  10  NIL  1  T  NIL  T 
<Running  CorExerge> 

;  Exiting  Lisp 
LINEARIZAING. . . 
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<Running  OxyGen  2.0> 

;  Exiting  Lisp 
RANKING. . . 

/f s/clip-plus/habash/PACKAGE/EXERGE/halogenize  torotemp . out . gls  torotemp.out.txt  6 
/f s/clip-plus/habash/PACKAGE/HALOGEN/ForestRanker/ news .binlm 

<GLS-to-Forest  Conversion>  &&  <Running  HALOGEN> 

/f s/clip-plus/habash/HALOGEN/ForestRanker/polishsen .pi 
/f s/clip-plus/habash/PACKAGE/MATADOR/halolin-temp . senO 
>  /f s/clip-plus/habash/PACKAGE/MATADOR/halolin-temp . sen 
;  cpu  time  (non-gc)  420  msec  user,  10  msec  system 

;  cpu  time  (gc)  70  msec  user,  0  msec  system 

;  cpu  time  (total)  490  msec  user,  10  msec  system 

;  real  time  23,139  msec 

;  space  allocation: 

;  332,622  cons  cells,  7,882,064  other  bytes,  0  static  bytes;  Exiting  Lisp 

REPORTING. . . 
done ! 
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